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Abstract 

Multi-criteria decision making has been made pos- 
sible with the advent of skyline queries. However, 
processing such queries for high dimensional datasets 
remains a time consuming task. Real-time applica- 
tions are thus infeasible, especially for non-indexed 
skyline techniques where the datasets arrive online. 
In this paper, we propose a caching mechanism that 
uses the semantics of previous skyline queries to im- 
prove the processing time of a new query. In addition 
to exact queries, utilizing such special semantics al- 
low accelerating related queries. We achieve this by 
generating partial result sets guaranteed to be in the 
skyline sets. We also propose an index structure for 
efhcient organization of the cached queries. Experi- 
ments on synthetic and real datasets show the effec- 
tiveness and scalability of our proposed methods. 

1 Introduction 

To address the problem of multi-criteria decision 
making and user preference queries over attributes 
in relations where there is no clear preference func- 
tion, Borzsonyi et al. [T| introduced skyline queries. 
The classic example of a skyline query involves choos- 
ing hotels that are good in terms of two attributes, 
price and distance to beach. The query discards ho- 
tels that are both dearer and farther than a skyline 
hotel. Formally, for every attribute, there is a pref- 



erence function that states which values dominate. 

Efficient indexes are difficult to built on relations 
available only at run-time or on-the-fly |15| . Hence, 
skyline queries suffer from large processing time and 
I/O bottleneck. Caching techniques improve the situ- 
ation to some extent. However, the use of traditional 
tuple and page caching techniques do not promise sig- 
nificant improvement for skyline queries as user inter- 
ests are unpredictable and an inexact query with even 
a slight modification where preferences are over a dif- 
ferent subset of attributes, results in a cache miss. 
For example, consider the following skyline queries: 

select * from Airlines skyline of 

Duration min, Cost min, Services max 

select * from Airlines skyline of 

Duration min, Cost min, Rating max 

The new query 

select * from Airlines skyline of 
Duration min. Cost min 

can be answered completely from the cache if the 
results of the previous one are stored and intelligent 
semantic caching techniques are applied. 

The special semantics of the skyline queries al- 
low such similar or related queries to be processed 
mostly from the cache using the results of the pre- 
vious queries, without accessing the database. Al- 
though not all skyline queries can be handled so effi- 
ciently, the use of cache does significantly accelerate 



them by producing at least partial results, which is 
not possible using traditional caching mechanisms. 
Our contributions in this paper are as follows: 

1. We introduce the concept of semantic caching 
for skyline queries. 

2. We categorize a new skyline query into four types 
according to the content in the cache and design 
efficient algorithms to process each of them. 

3. We design an index structure for organizing the 
past skyline queries in the cache and show how 
this index helps in searching the cache for pro- 
cessing the new query. 

The rest of the paper is organized as follows. Sec- 
tion [2] reviews previous research on semantic caching 
and skyline queries. In Section [3j a cache model is 
designed for reusing result sets of previous skyline 
queries. Section [4] describes an index structure to 
organize and access the semantic descriptions of past 
queries efficiently. It also describes the cache replace- 
ment policy. In Section [Sj the performance of skyline 
caching is examined through experiments. Finally, 
we summarize our work and discuss future research 
in Section |6] 

2 Background and related work 

Consider a relation R with preferences specified for k 
attributes. A tuple = (r^i, ri2, . . . ,rik) dominates 
another tuple rj = (f ji, ?'j2j • ■ • i i^jk) (denoted by >- 
Tj) if for all k attributes, r^c is preferred or equal to 
Tjc, and for at least one attribute d, rid is strictly 
preferred to rjd- The preference functions for each 
attribute are specified as part of the skyline query. A 
tuple r is said to be in the skyline set of R if there 
does not exist any tuple s ^ R that dominates r. 

Skyline queries have been imported to databases 
from the maximum vector problem or Pareto 
curve [ini in computational geometry. The first algo- 
rithm was proposed by Kung et al. [TO; . BNL [T] uses 
a nested loop approach by repeatedly reading the set 
of tuples. SFS 3 improves it by sorting the data 
based on a monotone function. LESS [7] combines 



the best features of these external algorithms; how- 
ever, its performance depends on the availability of 
pre-sorted data. A divide- and- conquer approach to 
partition the data so as to fit into the main memory 
was proposed in [5]. Using index structures, algo- 
rithms such as NN [3] and BBS [TH| have been pro- 
posed. 

The idea of caching query results to optimize sub- 
sequent query processing was first studied in [51 [11] . 
Several algorithms have been proposed in [H [T31 21 |S] 
that uses semantic caching efficiently and effectively 
for general applications. Also dynamic caching poli- 
cies have been studied [l4] . 

Several inteUigent structures, e.g., SkyCube [17] 
and compressed skycubes [T5] , have been proposed to 
efficiently compute the varying skyline queries based 
on approximate correlated user queries by using the 
computational dependencies among related queries. 
However, complete construction of these structures 
are inefficient in real-time applications. Further, in 
caching scenarios, the entire cube may not fit in the 
limited cache size. In this paper, we revisit the con- 
cept of semantic caching for skyline queries and pro- 
pose novel and intelligent algorithms along with an 
indexing scheme. 

3 Capturing semantics of sky- 
line queries 

In this section, we characterize a skyline query in 
terms of previous skyline queries, which help relate 
the new query to those in the cache. 

3.1 Characterization of queries 

We assume that all the skyline queries are for a sin- 
gle relation]^ We also assume the distinct value con- 
dition |17j which states that if no two data points 
have the same values for all the dimensions, then the 
skyline result for dimension set A is a subset of the 
skyline result for dimension set B when A <Z B. Each 
query is represented as the set of attributes of sky- 
line preferences, which we assume is not altered for 

^For different relations, separate (logical) caches can be 
maintained. 
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a particular dimension. This assumption holds since 
the preferences of users are generally the samej^ 

Given a cache modeled as a set of queries: C = 
{^i, 5*2, . . . , S'n} where each cached query Sj is again 
a set of attributes, a new query Q = {ai, 02, . . . , a,} 
can be characterized into at least one of the following 
groups: 

1. Exact Query: Q is an exact query if it matches 
exactly with a cached query, i.e., 3Sj,Q = Sj, 
indicating the re-occurrence of a previous query. 

2. Subset Query: Q is a subset query if all its 
attributes are completely contained in a cached 
query, i.e., 3Sj,Q C Sj. 

3. Partial Query: Q is a partial query if some of 
its attributes are subsets of a cached query, i.e., 
3Q' c Q,3Sj,Q' c Sj. 

4. Novel Query: Q is a novel query if none of its 
attributes are cached, i.e., if Va^ G Q,ySj,ai ^ 
S,. 

The hierarchy of categorization is important for 
query processing (details in Section 3.3). The most 



restrictive category determines the type of the query. 
For example, if a query is both an exact and a subset 
query, it is treated as an exact query and a query is 
categorized as a novel query if and only if it cannot 
be characterized as an exact, subset or partial query. 
Table [1] describes an example in detail. When a new 
skyline is queried, all the semantic segments stored 
in the cache are scanned to determine the type of the 
new query. 

Table [l] describes an example in detail. The con- 
tents of the cache are shown in the top row. The 
main rows of the table depict how each query can be 
categorized into the different query types. For exam- 
ple, query Qi is an exact query because it matches 
with 5*2. It is also a subset query as its attributes 
are completely contained within Si . Similarly, it can 
be categorized as a partial query since some of its at- 
tributes are contained in the cached queries Si and 



S2. However, it will be treated as an exact query 
since that is the most restrictive category. Query Q2 
is similarly classified as a subset query even though 
it is also a partial query. Query is a simple par- 
tial query. Query Q4 will also be treated as a partial 
query even though some of its attributes (attribute 
7) is not cached at all. Query Q5 is a novel query as 
it cannot be categorized into any of the three other 
types. 

3.2 Semantic segments 

While a cached semantic query is simply a set of at- 
tributes, certain other descriptors are also encapsu- 
lated in a data structure called the semantic segment 
for each query. The semantic segment for a query 
contains the following fields: 

• Attributes and preferences: Attributes on which 
the skyline preferences are applied. 

• Result: A link to a table of records that consti- 
tute the answer to this query. 

• Replacement value: It is used for cache replace- 



ment methods (see Section 4.5 ) 



■^The case where preferences may vary can be handled by 
considering each (attribute, preference) pair as a separate at- 
tribute. 



3.3 Query processing algorithms 

Based on the type of the new query, different query 
processing strategies are followed as described in this 
section. 



3.3.1 Exact queries 

If the query is an exact query, the result set of the 
cached query is directly returned as the result set of 
the new query. 

3.3.2 Subset queries 

If the new query Q is a subset of a cached query Sj , 
then the following lemma shows that the result set of 
Q is a subset of the result set of Sj . 

Lemma 1. If a skyline query Q is a subset of another 
skyline query S, then the result set of Q is completely 
contained in the result set of S . 
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Cache 


^1 ={1,2,3}, ^2 = 


-{1,2}, 


= {3,4}, ^4 = {5, 6} 


Query 


Exact 


Subset 


Partial 


Type 


Qi ={1,2} 


^2 


Si 


Si, S2 


Exact 


Q2 = {2,3} 




Si 


Si, S2, S3 


Subset 


Q3 = {4, 5} 






S3, Si 


Partial 


Q4 = {6,7} 






Si 


Partial 


05 = {7,8} 








Novel 



Table 1: Characterization of queries. 



Proof. Suppose Q = {oi, 02, . . . , a^}. Since 
it is a subset of S, S can be written as 
{ai,a2, ■ ■ ■ ,aq, si, S2, ■ ■ ■ , Sn}. Consider a tuple v 
which is a skyline record for Q. Given the distinct 
value condition, this implies that there does not ex- 
ist any tuple u )^ v such that u dominates v in all 
the attributes {oi, 02, . . . , a^}. Therefore, u cannot 
dominate v when more attributes {si, S2, . . . , s„} are 
added. Thus, w is a skyline record for S as well. 

However, there can exist a tuple u which is a sky- 
line record for S but not for Q. Assume that t >- u 
in {fli, a2, . . . , flg} but u>~ tin {si, S2, . . . , s„}. Since 
u is dominated in all attributes of Q by t, it is not a 
skyline record for Q. □ 

The next lemma shows that to determine whether 
a tuple from the result set of Sj D Q is in the result 
set of Q, only the tuples in Sj need to be checked for 
dominance. 

Lemma 2. If a tuple v in the result set of S is 
not a skyline for Q d S, then there must exist 
u G result(S) such that u )~ v. 

Proof. Suppose v £ result{S) is dominated in the 
attributes of Q by a tuple t ^ result{S). Since t is 
not in the result set of S, there must exist a tuple 
u G result{S) that dominates t in all the attributes 
of S including that of Q. Thus, u )^ t and t >- v which 
together imply u)^ v, which is a contradiction. □ 

Hence, if none of the tuples in the result set of Sj 
dominate a tuple u in all the attributes of Q, there 
cannot exist any other tuple in the relation that can 
dominate u. Then, u will be in the result set of Q. 
Otherwise, it will not be. 



If a new query Q is a subset of many cached queries 
Si,Sj, etc., the processing becomes even faster. Any 
tuple which is in the result set of Q must be in the 
result set of all of Si,Sj, etc. Thus, only the tuples 
that are in the intersection of the result sets of these 
subset queries need to be examined. 

While subset and exact queries can be processed 
from the cache itself without accessing the database 
at all, the advantage cannot be retained for the other 
two types of queries as explained next. 

3.3.3 Partial queries 

Suppose the new query Q is partial to a cached query 
Sj. The attributes Q' (Z Q are contained in 5*^, and 
is equal to Sj C Sj . Using Lemma flj the skyline 
corresponding to the attributes Q' — Aj is a subset 
of the skyline set maintained for Sj. This subset is 
computed and it serves as the base set. A special case 
of partial queries allows the base set to be directly 
available - when the query is a superset of Sj, i.e., 
Q' = Sj. The entire skyline set of Sj then serves as 
the base set for Q. 

Unlike the case for subset queries, the computation 
of the base set does not complete the processing. The 
following lemma shows there may exist a tuple not in 
the base set (i.e., the skyline set for Q'), but is part 
of the skyline set of Q. 

Lemma 3. A tuple in the skyline set of Q need not 
be in the skyline set of its subset Q' . 

Proof Suppose Q = {q[,q'2, . . . ,q'^„qi,q2, . . . ,qm} 
and its subset Q' = {q'^, q'2, . . . , q'n\ . Consider a tu- 
ple V that is in the skyline set of Q, i.e., there is no 
tuple u that dominates v in all the n-\- m attributes. 
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However, it may well be the case that u >- v in the 
attributes , ^2 7 • • ■ i 9n while v >- u in the other at- 
tributes (71, g2, • ■ • , Qm- Then, v will not be a skyline 
tuple for Q'. □ 

Thus, the base set alone is not sufhcient; it is neces- 
sary to look for tuples that satisfy the skyline criteria 
from the database. Computing the base set may then 
seem as a useless exercise as scanning the database 
cannot be avoided anyway. However, the base set 
helps in two important ways. 

First, since the tuples in the base set are guaran- 
teed to be in the skyline set of Q, they can be output 
immediately. For real-time applications, the implica- 
tions of this concept of incremental results are enor- 
mous. Without accessing the database at all, some 
skyline records are output; while the user is busy ex- 
amining them, the other skyline tuples can be com- 
puted and fetched from the database. 

The second important advantage is the fact that 
the use of a base set can speed up most of the generic 
skyline algorithms, such as BNL [T], SFS [3], and 
LESS [7]. These algorithms maintain a window of 
possible skyline tuples at all times found by scanning 
the database in order. Since the base set fits in the 
memory (as it is in the cache) and is guaranteed to 
contain only skyline tuples, it can significantly im- 
prove the query processing time by serving as the 
initial window. For other non-indexed algorithms, 
the base set may or may not help, but will never de- 
teriorate the performance. 

If there are two or more queries Si,Sj, etc. that 
are partial to Q, base sets can be computed from 
all of them. The union of these sets serve as the 
consolidated base set which can then be used. Since 
this combined base set is larger than any of the base 
sets, the advantages are more pronounced. 

3.3.4 Novel queries 

Since the novel queries contain attributes on which 
no previous skyline operator has been applied, the 
cache does not contain any information that can be 
used to expedite the processing. Consequently, such 
queries are completely processed from the database. 



3.4 Need for an index structure 

Processing a new query first involves searching all the 
semantic segments in the cache to determine its type. 
This is a tedious task when the number of semantic 
segments is large. As the number of semantic seg- 
ments is exponential in number of dimensions, it can 
be very large for high dimensional datasets. 

However, there is an even bigger concern when the 
semantic segments are not organized. Consider two 
cached queries Si and 5*2 where 6*2 C S'l . The tuples 
that form the result of S2 are already stored in the 
result of Si. However, when the semantic segments 
are stored naively, these tuples are maintained twice 
in the cache, thereby wasting precious cache memory. 
The problem is compounded when more queries that 
are subsets of S2 are stored. 

An efficient organization of the semantic segments 
in the cache that can avoid storing redundant records 
and can retrieve the result set by comparing with 
lesser number of cached queries instead of comparing 
with all of them is, thus, required. 



4 Index structure 

The index structure that we design is a directed 
acyclic graph (DAG) linking the difi^erent semantic 
segments. The semantic segment for a query Si is 
made a child of the semantic segment for a query S2 if 
S*! C 5*2. Clearly, a semantic segment can have multi- 
ple parents, but there cannot be any cycle. Note that 
the graph may be a forest, hence a pseudo root node 
is added that acts as the parent of all root nodes to 
make it connected. In comparison to SkyCube based 
structures, it does not contain the entire gamut of 
the user query space and is based only on the queries 
previously encountered, thereby befitting cache space 
requirements. 

4.1 Modified semantic segments 

To maintain this index structure, in addition to the 
fields described in Section 13.21 two more fields are 



added to each semantic segment for efficient manage- 
ment of links among semantic segments: 
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• Child pointers 

• Bit vectors 

The child pointers hnk a semantic segment to its 
children. For each attribute of the query, a bit vector 
is maintained. The size of the bit vector is equal to 
the number of children. The children of a node are 
ordered according to their arrival. The i'^ bit in the 
j^^ bit vector is set to 1 if and only if the i*'^ child 
contains the j^^ attribute. The bit vectors help to 
retrieve the required children for an attribute quickly. 

The size of the bit vector is not constant; rather, it 
grows or shrinks with the number of children. How- 
ever, since the order of the children is fixed, there is 
no ambiguity about which bit refers to which child. 



4.2 Eliminating redundancy of result 
sets 

The query processing algorithms use the index struc- 
ture to eliminate the redundancy of result sets be- 
tween a cached query and its subsets. If a query 
has a child (i.e., a subset), then all the skyline tu- 
ples are not stored in the result set; rather, they are 
distributed between itself and the child. For exam- 
ple, suppose query has a child 5'2, which is a leaf 
node. The skyline tuples for 5*2 are stored in its re- 
sult set, i.e., r{S2) — s{S2)- However, since these 
records are a subset of the skyline tuples for Si, re- 
dundancy is removed by not storing them again in 
Si . Instead, only the difference of the skyline set for 
Si with 52 are stored, i.e., r{Si) = s{Si) — s{S2)- The 
complete skyline records for Si can be retrieved by 
combining the result set of 5*1 with that of S2, i.e., 
s{Si) = r{Si) U r{S2)- In general, when there are 
multiple children, the skyline records of all of them 
need to be combined to retrieve the result set for the 
parent. 

We next explain how a semantic segment is inserted 
into the index of the cache. Note that a semantic 
segment is inserted only when it is queried. 



4.3 Query processing and insertion 
using index 

We illustrate the index search operation for query 
processing and subsequent insertion using the series 
of query examples as shown in Fig. [T] In the figures, 
only the attributes and the node ids are shown for 
simplification. 

Initially, the cache is empty and the index simply 
contains the pseudo root node. When the first query 
{1,2} arrives, it is classified as a novel query, and is 
inserted as semantic segment Si (Fig. [I^) . 

The next query is {1, 2, 3}. All the root nodes are 
searched, and it is found out that this new query is 
a superset of a cached query. Hence, it is classified 
as a partial query, and the entire skyline set of 5*1 is 
used as the base set. The new query now becomes 
the root and the old root its child (Fig. [TJj). 

Then, query {3,4} arrives. Scanning the root 
nodes, it is found to be partial to 52- The base set is 
computed which consists of the skyline tuples for the 
common attributes, i.e., {3}. This semantic segment 
(5*4) is a subset of both 5*2 and the new query ^3 and 
is, therefore, maintained as a child of both (Fig.jl]:). 

The next query {5,6} is a novel query as it does 
not match with any of the root nodes. Consequently, 
it is processed from the database and is inserted as a 
new root node in the index (Fig.[T]i). 

Next is an exact query {1, 2}. The roots are 
scanned, and is found to be a subset query of the first 
root S'2. The children of this root are then searched 
to see if the categorization can be improved (as in 
this case). The skyline set of Si is returned as the 
answer and no change is made to the index (Fig.jlj;). 

Query {2,3} then arrives. Being a subset of S'2, 
only the children of S'2 are searched, but no exact 
match is found. The skyline set of {2, 3} is computed 
from that of {1, 2, 3} and is inserted as a child of S2. 
Since the skyline set of {3} is already maintained as 
a semantic segment (Si), and it is a subset of this 
new query as well, the child pointers and bit vectors 
are appropriately modified in S'2 and Sq to reflect the 
fact that Si now is only a descendant of S2 and not 
a direct child (Fig. [if). 

Queries {4, 5}, {6, 7} and {8, 9} are simflarly han- 
dled (Figs.[l^,[l|i and[l|). 
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Figure 1: Querying and insertion of semantic segments in the index. 



4.4 Deletion from index 

When the cache is full and a new query arrives, an 
effective replacement policy must be chosen to select 
the replacement candidate. Further, since the cache 
is very dynamic, efficient update operations on the 
index need to be designed. 

The skyline set of a parent in the index is shared 
among itself and its children, and the union of these 
sets are computed for the result. Therefore, if a child 
is deleted from the cache, for correctness, its skyline 
set has to merged back with that of its parent. Since 
the size of the skyline set is the largest factor for the 
size of a semantic segment, deleting a child does not 
produce much advantage. Thus, for our index struc- 
ture, we only delete the root nodes and the children 
become the new roots if they have no parent. 

4.5 Cache replacement 

Due to limited cache size, not all semantic segments 
encountered can be stored. This is the main draw- 



back of SkyCube-based techniques. For efficient use 
of cache, the most useful semantic segments need to 
be preserved and the rest should be replaced. 

The first important parameter is the usage fac- 
tor (a). When the semantic segment is first intro- 
duced into the index, its replacement factor is set to 
1. Every time its result set is used, the value is in- 
cremented. The one with a lower replacement factor 
should be replaced, as it is being less used. 

The second important factor is the size of the 
skyline set, i.e., the number of tuples in it. Since the 
available memory in the cache is a premium asset, a 
semantic segment that stores a large number of tu- 
ples as its skyline set does not allow other semantic 
segments to be stored. Hence, it should be removed. 

The third parameter that determines the useful- 
ness of a semantic segment is dimensionality (d). 
When the number of dimensions is more, there is 
more chance of a new query to become a subset of 
it or to have more overlap in case it is a partial query 
and, therefore, should not be replaced. 

A replacement value (S) for each semantic seg- 
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Parameter 


Values 


Cardinality {N) 
Dimensionality (d) 
Cache size ( C ) 
Number of queries (|Q|) 


1 X 10'', 3 X 10^, 1 X 10^, 3 X 10^ 1 X 10^ 
3, 4, 5, 6, 7 

0. 1%, 1%, 3%, 5%, 7%, 10% 

1, 5, 10, 25, 50, 100 



Table 2: Experimental parameters and their default values (in bold). 



ment is computed by combining the three, i.e., S = 
f{a, (3, d). The semantic segment with the lowest S is 
the least useful and should be chosen for replacement. 
The function /, therefore, should be monotonic with 
a and d and anti-monotonic with /3. While differ- 
ent functions fit the condition, the following simple 
function empirically produces good results: 

S = {axd)/P 

5 Experimental results 

In this section, we evaluate the performance of the 
caching techniques. The techniques were imple- 
mented using Java on an Intel Core 2 Duo 2GHz ma- 
chine with 2GB RAM in Ubuntu Linux environment. 
For skyline computation, we used the non-indexed 
sort-filter-skyline (SFS) ^ algorithm. We analyzed 
and compared the execution times of three different 
skyline processing techniques: (i) without using cache 
(NC), (ii) using cache without using the index (NI), 
and (iii) using cache with index (Index). 

5.1 Synthetic datasets 

We used the standard data generator for sky- 
line queries from http : //www . pgf oundry . org/ 
projects/randdataset to generate synthetic 
datasets; the dimensions were chosen to be inde- 
pendent. The scalability and performance of the 
techniques on synthetically generated data were 
measured against four different parameters: (i) car- 
dinality of the dataset, (ii) dimensionality of the 
dataset, (iii) size of the cache, and (iv) number of 
queries. The values of these parameters were varied 
according to Table |2] To study the effect of one 



parameter, the other parameters were held constant 
at the default values shown in bold. 

Fig. [2]^a) shows the performance of the differ- 
ent techniques with varying dimensionality. As di- 
mensionality increases, the cardinality of the skyline 
set increases roughly exponentially for independent 
datasets [21 [6] . The running time of the non-caching 
method more or less shows the same behavior. The 
number of semantic segments need to be maintained 
increases exponentially as well. Thus, when no in- 
dex is used in the cache, the running time is more 
than when index is used. After d = b, the size of 
the cache is not enough to hold all the semantic seg- 
ments, and many new queries are classified as novel 
queries or partial queries. Consequently, the running 
time increases. 

Fig. [2](b) shows the effect of the cardinality of the 
dataset. For small datasets, the overhead of searching 
through all the semantic segments makes the caching 
method slower than simply processing the skylines 
from the database. For larger datasets, the overhead 
becomes negligible as compared to the gains of using 
the cache; consequently, the non-caching technique 
performs the worst. The indexing technique reduces 
this search overhead and, hence, requires the least 
amount of time for all datasets. 

We next investigate the effect of cache size, mea- 
sured as a percentage of the size of the dataset. Since 
the non-caching method does not depend on the size, 
it is omitted from this experiment. Fig. [3](a) shows 
how the running time is affected by varying cache 
size. When the size is very small, only a few semantic 
segments can be stored. In such situations, indexing 
helps only to a small extent. As the size increases, 
indexing allows more semantic segments to be stored 
because of the way a semantic segment shares its re- 
sult set with its subsets. The non-indexing method. 
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Figure 2: Effect of (a) dimensionality, (b) dataset cardinality. 



on the other hand, suffers from processing too many 
semantic segments without much gain. When the 
cache becomes quite large, it allows most of the se- 
mantic segments to be stored along with their result 
sets. More queries can now be classified as exact 
or subset queries and the performance of the non- 
indexing method improves. The performance of the 
indexing technique saturates and does not improve 
after a point. 

Ideally, when there is enough space in the cache, 
and the system has "seen" all possible skyline queries, 
any new query should be answered very fast. The 
final set of experiments tries to understand this phe- 
nomenon in more detail. 

Figure |3][b) shows the average running time of a 
query as more and more queries arrive. When no 
caching is used, the number of queries do not have 
any effect, and as expected, the average running time 
of a query varies randomly. For the first few number 
of queries, the cache is virtually empty, and process- 
ing the cache yields no hits and no benefit at all. 
In fact, the overhead of maintaining the cache wors- 
ens the performance in comparison to the no-caching 
technique. Subsequently, as more queries arrive, the 
performance improves for the indexing method. How- 
ever, when indexing is not used and the semantic seg- 
ments are left unorganized in the cache, lesser number 
of semantic segments are stored due to redundancy 
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Figure 4: Progressive performance of different tech- 
niques as more queries arrive. 

of the result sets. This leads to less number of cache 
hits, and the performance suffers. 

5.2 Real datasets 

We also tested the performance of the tech- 
niques on a real dataset from |http://wwwj 
databas ebasketball . comT) The database provides 
the statistics of NBA players with different attributes 
such as total points, assists, field goals made, free 
throws made, etc. Among these, six different dimen- 
sions were chosen where the data is not missing for 



9 



I ' — ^ 1 1 1 

0.1 1 3 5 7 10 

Cache Size (|C|) (in % of datacardinality) 

(a) 

Figure 3: Effect of (a) cache 

most of the players. The cardinahty of the relation 
was 19,980. The cache size was set to 5% of that of 
the relation. 

The average running time of a query for the differ- 
ent techniques is plotted in Fig. |4] against the num- 
ber of queries. While the time for the non-caching 
technique stabilizes after a few queries, that for the 
caching methods decreases. Due to the superior or- 
ganization of the semantic segments by the indexing 
technique, the improvement is more pronounced as 
compared to the non-indexing technique. 



6 Conclusions 

In this paper, we have introduced the concept of se- 
mantic caching to accelerate a skyline query by clas- 
sifying it as one of the four types — exact, subset, par- 
tial and novel. While the exact and subset queries are 
processed directly from the cache, partial results for 
partial queries can be output from the cache before 
resorting to the database for the full skyline set. We 
also proposed an index structure to effectively orga- 
nize the past queries in the cache and improve the 
efficiency of the methods. Experimental results on 
synthetic and real datasets showed the effectiveness 
and scalability of the methods. In future, we plan to 
handle update-intensive databases. 
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(b) 

size, (b) query cardinality. 
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